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Abstract 

This paper concerns the robust regression model when the number of 
predictors and the number of observations grow in a similar rate. The¬ 
ory for M-estimators in this regime has been recently developed by several 
authors [El Karoui et al., 2013, Bean et al., 2013, Donoho and Montanari, 
2013]. Motivated by the inability of M-estimators to successfully estimate 
the Euclidean norm of the coefficient vector, we consider a Bayesian frame¬ 
work for this model. We suggest a two-component mixture of normals prior 
for the coefficients and develop a Gibbs sampler procedure for sampling 
from relevant posterior distributions, while utilizing a scale mixture of nor¬ 
mal representation for the error distribution . Unlike M-estimators, the pro¬ 
posed Bayes estimator is consistent in the Euclidean norm sense. Simulation 
results demonstrate the superiority of the Bayes estimator over traditional 
estimation methods. 


1 Introduction 

When fitting a linear regression model to data, estimators robust to outliers are 
often desired. One popular approach to achieve robustness to outliers is to use 
M-estimators under a penalty function other than the quadratic one. This method¬ 
ology is often termed as “robust regression”. Classical results for robust regression 
are that the M-estimator of the coefficients vector is consistent and normally dis¬ 
tributed [see Huber, 2011, Chap. 7]. These results were obtained for the case p, 
the number of predictors, is fixed or grows slowly with the number of observations, 
n. The case where p grows faster than n have been drawing a lot of a attention. 
In that scenario, a popular approach is to consider penalization based estimation 
methods, e.g., the Lasso [Tibshirani, 1996]. 

We consider a different scenario. Assume that p < n, yet p grows at the same 
rate as n. That is , p/n k for some positive constant k < 1. This scenario was 
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first recognized as an interesting one by Huber [1973]. It is, however, only with 
the emergence of “big data” that researchers have begun to investigate the robust 
regression model under this regime. Asymptotic distribution and variance calcu¬ 
lations were recently developed for M-estimators in this regime [El Karoui et ah, 
2013, El Karoui, 2013, Donoho and Montanari, 2013]. Arguably the main result 
is that while the obtained estimator is normally distributed, its variance differs 
from Huber’s classical results. Bean et al. [2013] have further shown that, unlike 
the classical p -C n scenario, the optimal M-estimator, in terms of efficiency, is not 
obtained by maximizing the log density of the errors. One more striking result 
is that for Double-Exponential errors, for k larger than approximately 0.3, least 
squares regression is superior to least absolute deviations regression. 

We argue in the current paper that M-estimation might be the wrong approach 
to the robust regression model in the p/n —> k, 0 < k < 1 regime. Our main 
motivation is as follows. When using M-estimators, the estimation error of a 
single coefficient is in the usual order of n _1 / 2 [El Karoui et ah, 2013]. However, 
the error accumulated over the coefficient vector does not vanish when n —> oo. 
We will further argue that when signal and the noise are of the same asymptotic 
order, only a few of the true coefficient values can be larger (in their absolute 
value) than n -1 / 2 . Putting it together, the estimation error of small coefficients, 
which are the majority, is larger than their actual value (in absolute value) when 
using M-estimators. Thus, apparent large effects can be actually microscopic. 

Recognizing this characteristic of the problem, shrinkage methods may be a 
better fit in this robust regression model. Shrinkage can be achieved by using either 
the aforementioned regularization methods or by using a Bayesian approach with 
appropriate scaling of the prior hyperparameters. In this paper we consider the 
latter option. 

It is well known that shrinkage can be achieved using a Bayesian methodology. 
Most of the discussion is centered in the normal error model. The James-Stein esti¬ 
mator, [James and Stein, 1961], is an empirical Bayes estimator [Efron and Morris, 
1973]; Ridge regression estimator is identical to what we get if assuming the re¬ 
gression coefficients arc iid with normal prior, and a maximum a posteriori (MAP) 
estimator is used; and if we replace the normal prior with Laplace distribution prior 
we get the Lasso. The Laplace prior for the coefficients is actively researched. An 
efficient Gibbs sampler was suggested by Park and Casella [2008]. However, the 
Lasso attracts criticism from a Bayesian point of view, since the full posterior 
distribution of the coefficients vector does not attain the same risk rate as the 
posterior mode [Castillo et ah, 2015]. Another, more recent, prior proposal is the 
horseshoe prior [Carvalho et ah, 2010, 2009], which has some appealing properties, 
at least when the design matrix is the identity [van der Pas et ah, 2014]. 

Our description above implies that the coefficients can be separated to two 
groups: small-value and large-value coefficients (in their absolute size). This per¬ 
spective aligns with existing Bayesian variable selection literature, where priors 
are assigned hierarchically. First, coefficients are separated into two groups and 
then a prior distribution is determined according to the group assignment. Often, 
though not necessarily, one of the priors is the degenerate distribution at zero. A 
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leading example for this framework is SVSS, Stochastic Search Variable Selection 
[George and McCulloch, 1993, 1997]. An alternative approach is to have a prior 
distribution on the number of non-zero coefficients, to choose these coefficients 
uniformly, and then to have a prior on the non-zero coefficients [Castillo et ah, 
2012 ], 

In this paper we suggest a full Bayesian model for the robust regression model 
when p/n k, 0 < k < 1. We choose the prior distributions and hyperparameters 
such that our prior knowledge on the design is taken into account. We then utilize 
a scale mixture of normal representation of the error distribution to construct a 
reasonably fast Gibbs Sampler. 

The rest of the paper is organized as follows. In Section 2, we present notation 
and model assumptions, and claim that M-estimation should not be used in this 
p-close-to-n regime. In Section 3, we introduce an hierarchical Bayesian model and 
then, in Section 4, we present a Gibbs sampler for parameter estimation. Detailed 
example is given in Section 5 where we also present simulation results. Section 6 
offers conclusion remarks. Proofs are given in Section 7. 
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Achilles heel of M-estimators when p/n —> k E 

( 0 , 1 ) 


We start with notations. We use || • ||, || • ||i and || • Hoc for the Euclidean norm, the 
t\ norm and the maximum norm of a vector, respectively. Throughout the paper 
we consider the model 

y(n) =X { n )p(n) +e (n), (1) 

where e ^ is a vector of i.i.d random variables with a known density function 
/ £ (-;0) characterized by 6 , an unknown parameter, X ^ is a matrix of random 
predictors, j3^ is an unknown parameter vector we wish to estimate. To improve 
clarity, we henceforth omit the superscript indicating that all model components 
given in (1) depends on n (6 excluded). We denote Xf for the i th row of X. X 
and e are assumed to be independent. We denote j3° for the true value of (3. For 
a given penalty function p, the M-estimator of /3, /3 P , defined as 


/3 P = arg min V p(Y t - Xj (3). (2) 

P “7 


If p is convex one could alternatively solve the equation 


Y,X?il>(Yi-XTl3) = 0, P:=p'. 

2—1 


Huber’s classical result [1973] is that if p 2 /n —* 0 then y/n(f3 p — (3°) is asymptoti¬ 
cally normal with a covariance matrix 


Em 

E 2 m 


lim (X T X)~ 1 

n—>-oo 
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If it is further assumed that E(e i) = 0, then by using general M-estimation theory 
it can be shown that this result holds for p/n —> 0. Portnoy [1984, 1985] derived 
consistency and asymptotic normality of M-estimators in the robust regression 
model under weaker assumptions. See also Maronna and Yohai [1981]. 

We now claim that in the model described above, a robust regression model 
where the number of predictors and the number of observations are similar, M- 
estimators have undesirable properties. But first, we present our model assump¬ 
tions: 

(Ml) lim * =«G (0,1). 

n—xx> n 

(M2) The rows of X are i.i.d N{ 0, £) for a known covariance matrix sequence 
£ = £ p . Furthermore, assume the eigenvalues of £ are bounded away from 
zero, for all p. 

(M3) i — 1, 2,..., n are i.i.d mean zero random variables with a density function 
f e . l t = log f e is concave, bounded from above and has three bounded 
derivatives, such that inf| t | <M i'[(t) > 0 for any M < oo. 

(M4) f e is symmetric and for the function g e (u) = f e (\/u) we have, for u > 0 and 

k = 1 , 2 ,..., 

(-£) 9{u) -°■ 

Assumption (M4) would be exploited when we will consider the Bayesian formu¬ 
lation of the problem. We note in passing that this assumption is fulfilled by rich 
family of distributions, such as Student’s T distribution, the Laplace distribution, 
and the more inclusive exponential power family [Andrews and Mallows, 1974, 
West, 1987], 

We now take the frequentest point of view, which we will later, in the next 
section, replace in a Bayesian perspective. When estimating a multidimensional 
parameter a loss function is needed to aggregate over the different components. A 
natural loss function is the i 2 of the estimation errors. The following proposition 
motivates our discussion. 

Proposition 1 . Let assumptions (Ml)-(M3) and the regularity assumptions needed 
for Result 1 in El Karoui et al. [2013] hold. Let j3 p be the M-estimator defined in 
(2) with respect to a ?ion-linear convex function p. Then \\fi p — /3°|| = O p ( 1) 

A proof using the results of El Karoui et al. [2013] is given in Section 7. 
Proposition 1 implies that the so-called ^-consistency cannot be achieved by 
M-estimators under this regime. 

However, we now claim M-estimators might be the wrong approach here. Con¬ 
sider specifically the statistical interesting problem arising when the signal and 
the noise are of the same asymptotic order, i.e, when Xif (3° = O p ( 1). For a mo¬ 
ment, let £ = /. Then, Xfi f3° = O p ( 1) implies ||/3°|| = O p ( 1). If £ 7 -I, then 
since £ is known, the last statement holds if taking X t = X h £~ 1//2 instead of X 
Informally, having the number of predictors in the same scale as the number of 
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observations while considering a finite signal-to-noise ratio, that does not vanish 
as n grows, implies additional assumptions on (3° structure. For example, not too 
many components of f3° can be much larger than n -1 / 2 (in their absolute value), 
otherwise the signal would be stronger than assumed. On the other hand, if /3° is 
concentrated very close to zero for all n, then the signal is too week as Xff3° is 
too small. 

M-estimation is invariant to translation of /3°, but some /3° values are less ex¬ 
pected then others, as we argued in the preceding paragraph. Therefore, another 
approach, which exploits that knowledge on the parameter vector /3° is desirable. 
Since we know that many of the true coefficients are smaller than n _1 / 2 (in their 
absolute value) we could potentially gain better estimates if we shrink some co¬ 
efficients towards zero. This can be done using regularization based methods, or 
alternatively, using a Bayesian approach in a way that shrinkage is encouraged by 
the specified prior distribution. In this paper, we choose to take the latter option, 
and in the next section we develop such a Bayesian hierarchical model. 


3 A Bayesian Model 

A Bayesian model for the robust regression involves at least one more level of 
parameters. Assume model (1) holds with a density function f € that depends on 
a parameter 0, and obeys assumptions (M3)-(M4). We start with a prior for /3. 
As we argued in the previous section, we expect many of its components to be 
small, while some are considerably larger. To accommodate for this, we present 
a mixture prior for each (3j , j = 1 with two normal mixture components. 
Denote T = (ti, ... ,t p ) e {1,2} P for the vector that indicates for each component 
j if /3j has a large variance ( tj = 1) or a small one ( tj = 2). We also denote 0 = 

p 

1 {tj = 1}) for the number of “large” components in T. Finally, let 5%, k = 
j =i 

1,2 be the variance in each of the mixture components. Putting all together, the 
following assumptions depict our prior distributions. 

(PI) The prior for /? is of an iid mixture of two zero-mean normal variables 
(tj - 1)10 ~ Ber(<j>/p ) Pj\tj, <Si, 5 2 ~ N(0,6*.) 

(P2) 4> = o p (nf log n). 

(P3) is O p (0^ 1 ) and b\ is O p (l/n^), for some fixed £ > 1. 

Assumption (P2) implies that 0 grows with n, but yet it is much smaller than p. 
Note that under assumptions (P1)-(P3) we have 

E(m 2 \^ *2, 0) = <Ml + (p - 0)^2 = O p ( 1), 

so together with the Assumption (M2) we get that X'Jf3 = O p ( 1). 


5 


Working prior distributions can be taken for 0 , <5i and 82 , adding another level 
of hierarchy to the model. In practice, the parameters of these prior distributions 
are chosen such that assumptions (P2) and (P3) are fulfilled. In Section 5, we 
present such an example. Alternatively, 0 , and 82 may be taken as known values 
that obeys Assumptions (P1)-(P3). The prior distribution for 0 as specified in 
Assumptions (P1)-(P3) reflects our knowledge on 0 when we assume the signal 
and the noise in our model are of the same order. This prior implies that only 
a small part of the coefficients can be larger than the n -1 / 2 threshold. This can 
be stated formally using Chebyshev’s inequality; See Lemma 1 in Section 7. Of 
course, one can think of other prior distributions for 0 having this property. 

As for 9, we assume a prior distribution q{9 ), where with some abuse of nota¬ 
tion, q always denotes a density, and the particular relevant density would be clear 
from its argument. Unlike other model parameters, this prior does not change with 

71 . 

Before moving to the estimation procedure, we present a theoretical result 
showing that unlike M-estimators, a Bayes estimator for 0 can achieve £ 2 —' consistency. 
As stated before, M-estimators in our regime are consistent when considering each 
coordinate of the vector separately, but not when considering the parameter vector 
as a whole. The following theorem shows that Bayesian estimator in the discussed 
model is consistent (in the Euclidean norm sense) for the parameter vector. 

Theorem 1. Consider the model (1) and assume (M1)-(M3) and (P1)-(P3). Let 
(3° be the true value of (3. Let (3* be a Bayes estimator with respect to the posterior 
distribution q((3\Y, X) and a loss function L of the form L((3,f3 °) = L(\\j3 — 0 O ||), 
for a bounded L. Then ||/3* — 0°|| A 0 as n —>■ 00, 

where A denotes convergence in probability. The proof is given in Section 7. 


4 Sampling from the posterior distribution 

Modern Bayesian statistics relies on the ability to sample from the posterior 
distribution. This may pose a challenge especially if the parameter space is 
high dimensional. Assumption (M4) aids us in construction of a Gibbs sampler 
Genian and Geman [1984], with blocked sampling available for the full conditional 
of (3. Under the condition in Assumption (M4), we can write f e as a scale mixture 
of normal distribution, Andrews and Mallows [1974], 

/.&*) = / q{A\9)da 2 (3) 

where (p is the density function of a standard normal random variable. This 
implies that while q(Y\X, /3,9) is Yl™ = 1 fe(Yi — A 'f(3]6), q(Y\X, (3,a 2 ) is the den¬ 
sity of independent normally distributed random variables with mean zero and 
variance of. The mixing distribution q(a 2 \9) can be identified in some cases 
[Andrews and Mallows, 1974, West, 1987]. If f e is the density of a Laplace distri¬ 
bution, as in the example we present in Section 5, then q(cr 2 \9) oc exp(— 9 2 a 2 /2). 
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This representation adds n parameters as an individual erf is artificially intro¬ 
duced to the model for each observation i. However, it allows for direct sampling 
from all the full conditionals of all the parameters, and a Gibbs sampler can be 
used. This Gibbs sampler resembles the one suggested for the Bayesian Lasso 
[Park and Casella, 2008], especially for the case presented in our example in Sec¬ 
tion 5 when the errors have a Laplace distribution. Note however that in their 
case the scale mixture of normals representation is taken for the prior /3 and the 
errors are normally distributed, where here we apply this representation to the 
errors, and the prior for P is a mixture of two normal distributions. 

We now present the Gibbs sampler when <f>, bj and d 2 are known hyperparam¬ 
eters. In Section 5 we demonstrate how standard conjugate priors can be used for 
these parameters in a concrete example. The Gibbs sampler iterates between the 
full conditionals of 9,af,...,af,ti,...,t p and p. Starting with 9, one can sample 
from q(9\af..., erf) oc q(9) Y\ff =l q(erf\9). The mixing distribution q(crf\9) is deter¬ 
mined by the error distribution / e (f; 6), and depending on the prior q{9), we may 
have a conjugate family (as in our Section 5 example). Alternatively, we may use 
Metropolis-Hastings step for 9 only. Since 9 is one dimensional, we expect such a 
step to marginally affect the computation time of the Bayes estimator. 

Moving to each erf, we have 

« -V ( Y, ~ X ^ ) qtf\ff) (4) 

& \ ) 


and depending on the mixture distribution this may be a known distribution to 
sample from (see again Section 5). 

Next, let T = diag(of, erf, ..., erf) and let V = diag(<%, Sf 2 , ..., Sf p ). The 
full conditional of f3 is a multivariate normal with mean /i and variance A -1 where 
A = X t Y~ x X + H -1 and qi = A~ l X T Y~ 1 Y . This is the main advantage of the 
scale mixture of normals representation for / e ; block sampling of f3 can be used. 
Finally, denote pk = P(tj = k\/3j, <f), Si, 82 ) for the full conditional of each t r We 
have 


{Pi 


P 2 ) oc 



0 

P 


Mv)— 

0-2 \S 2 J p 


We now comment on alternative posterior sampling strategies, all involve the 
Gibbs sampler described above. When the mixing distribution q(cr 2 \9) results in 
a full conditional q(af\Yi, A*, P, 9) with no available direct sampling, the proposed 
Gibbs sampler must use other MCMC procedures for sampling from the full condi¬ 
tionals of erf. Moreover, if Assumption (M4) does not hold, then the scale mixture 
of normal representation (3) cannot be used. Then, sampling from 


q(9\Y, X, P) oc q(6) J] f e (Y t - X& 9) 

i=l 

cannot necessarily be done in a direct sampling manner, but additional MCMC 
procedure, such as Metropoils-Hastings should be used. This is not, however, 
where most of the problem lies. Block sampling for P would have been based 
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upon high-dimensional full conditional of (3 


V n 

q(P\Y, x , T, S u S 2 ) (X n S-'vWj/St,) MY - #)■ 

3 =1 i=1 

but since p is large, this could be a too ambitious approach, when using Metropolis- 
Hastings or other MCMC procedures. So as a last resort, one could use a Gibbs 
sampler that uses q(/3j\Y,X,T, 81 , 62 , /3-j), where (3-j is the [3 vector without its 
j'-th component. Both the approaches we just described are expected to be more 
computationally heavy than the procedure we suggested before. We remark that 
sampling from each q(cr 2 \Yi, Xi, (3,6) can be parallelized, to shorten computation 
time since the of’s are independent given 6 and f3. 

5 Example 

We present in this section a specific example. First we describe the model with 
more detail, then we move to the estimation procedure and then we present sim¬ 
ulation results with concrete numerical values. We assume a Laplace distribution 
for the errors, that is, 

fM e ) = |exp(-0|t|). 

This implies that the mixing probability is Exponential, with q(cr 2 \9) oc exp(— 9 2 cr 2 /2) 
[Andrews and Mallows, 1974], We take an Exponential prior distribution for 9 2 
(a Gamma prior distribution would also work). The resulting posterior for 6 2 is 
a Gamma distribution with a shape parameter of n/2 + 1 and a scale parameter 

n 

1 + (X) of )/2. Moving to 8 \ and <5 2 . Let N k = 1{^ = k}. Then given a prior 

i=i 

distribution q( 8 k ), the full conditional for 8 k is simplified to 

q(5 k \/3,T) oc (Si) 2 1 exp I f3 2 | q( 8 k ). 

\ k 1 fd= fc } / 

We take here a conjugate Inverse-Gamma prior distribution for S k with a shape 
parameter a k and a scale parameter j k . The resulting posterior is Inverse-Gamma 
with parameters a' k = a k + N k /2 and 7 a, = 7 k + E An alternative 

1 {tj=k} 

prior distribution here would be the Jeffreys improper prior q(S k ) oc 1 jS\. Next, a 
standard Beta(a,p, 7 ^) prior is taken for <f>/p so the resulting posterior is Beta(a r j,+ 
Xi,^ ( j > J rN 2 ). We note here that the values taken for a l5 a 2; 7i, 72 and 7 ^ should 

be on the background of Assumptions (P2)-(P3). These are working priors that 
reflect our assumptions on f3. The dimension of [3 grows with n, while its ^ 2 norm 
is assumed to be constant. Therefore, 0, Sf and 5% must change with n. This 
results in a working priors that also change with n. 

Sampling from the full conditionals of (3 and T remains the same as described 
in Section 3, and is generally the same regardless of / £ , as long as the scale mixture 


of normal distributions representation (3) is used. The same property holds for 
the sampling from the full conditionals of hi, S 2 and 0 as described above. 

Moving to of, substituting q(a 2 \9) oc exp(— 9 2 cr 2 /2) in (4), we have 

Similarly to Park and Casella [2008], this implies an Inverse-Gaussian distribution 
for a~ 2 with parameters a = 9 2 and b = 6 /\Yj —Xf/3\, where the Inverse-Gaussian 
distribution defined as having the density 


fit) 



a(l — b) 2 
2 b 2 t 


1 {t > 0} 


Chhikara [1988]. The full conditionals we just described are used in the proposed 
Gibbs sampler. We now turn to the presentation of simulation study results. 


5.1 Simulation Study 

We present simulation results for various hi values and for n = 500, 2500. We 
used 500 and 1000 iterations for the Gibbs sampler described above. Data were 
simulated in the following way. First, X was simulated under E = /, i.e., each 
of the iicl rows of X was simulated as N p (0,I). Then, we simulated 0, hi and h 2 . 
We used the values { a$ = 30,0^ = 30(3/ilog(n) — l)},{aq = 2,71 = log(n)/n} 
and {a 2 = 2, y 2 = n -L5 } for the prior distributions of 0/p, hj and h|, respectively. 
These values were chosen so E{cj)) = n/ (3 log(n)), E(S{) = log (n)/n, E(5 2 ) = n^ 1 " 5 
and also to have prior distributions that are not too concentrated around their 
means. Given 0, T was simulated as iid 5er(0/p). Then, we simulated /3 as 
independent normally distributed variables where the prior variance of each j3j is 
hf. Next, we simulate 6 from an Exponential prior q{9) = exp(— 6)1{6 > 0}, and 
simulate e as iid random variables that follow Laplace distribution with parameter 
9. Finally, the observed data is {X, Y}, with Y = Xf3 + e. 

We compared the Bayesian estimator to classical M-estimators, namely least 
squares and least absolute deviations estimators. We also considered standard 
regularization-based estimators. Let the objective function to be minimized writ¬ 
ten as 

n 

/(« = !>«-+ ACS) 

i=l 

where P\ is a regularization function and A is a tuning constant chosen here 
by cross-validation. We considered the following four estimators, defined by 
the choice of p and P\. Least squares Lasso (p(x) = x 2 ,P\((3) = ||0||i), least 
squares Ridge regression ( p(x ) = x 2 ,P\(/3 ) = ||/3|| 2 ), least absolute deviation 
Lasso ( p(x ) = \x\,P\(/3) = ||0||i) and least absolute deviation Ridge regression 
(p(x) = |x| ,P\(/3) = ||/3|| 2 ). The Erst two estimators were obtained using the 
algorithm described in Friedman et al. [2010] and the latter two estimators were 
calculated using the algorithm described in Yi and Huang [2015] (hqreg package 
in R). 
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Table 1 presents medians of \\/3 — /3|| 2 across 1000 simulations for the estimators 
we just described. We report the medians and not the means due to a small 
number of outliers, observed mainly for the regularization based methods. For 
each method, the £ 2 error grew with k, (or with p) for a fixed n, as one may have 
expected. The superiority of the proposed Bayes estimator is clearly shown. For 
n = 500 taking 1000 iterations of the Gibbs sampler resulted only in minor gain 
comparing to using the Gibbs sampler with 500 iterations. For n = 2500, taking 
1000 iterations was essential to improve this estimator’s performance. 

Considering Proposition 1, the relatively poor performance of the non regu¬ 
larized M-estimators was expected. As expected, for n > 0.3 the least absolute 
deviation (LAD) estimator was preferable over the least squares (LS) estimator. 
Among the regularization based methods, however, taking the least absolute devi¬ 
ation as a penalty function remained superior for all k values, both for Lasso and 
Ridge regression. Comparing the Lasso and Ridge regression, the former showed 
better performance for a fixed penalty function. Bayesian methods are often time 


Table 1: Medians values of \\/3 — f3 1| 2 across 1000 simulations for the proposed 
Bayes estimator using 500 and 1000 iterations of the Gibbs sampler (Bayes 50 o 
and Bayesiooo), the Lasso with square (LASls) and absolute (LASlad) deviations 
penalty functions, Ridge regression with square (RIDls) and absolute (RIDlad) 
deviations penalty functions and the standard least squares (LS) and least absolute 
deviations (LAD) estimators 


K 

Bayes 50 o 

Bayesiooo 

LASls 

n = 500 
LAS lad 

RIDls 

RID lad 

LS 

LAD 

0.1 

0.080 

0.083 

0.119 

0.103 

0.115 

0.099 

0.328 

0.252 

0.2 

0.104 

0.095 

0.127 

0.115 

0.136 

0.128 

0.643 

0.584 

0.3 

0.118 

0.111 

0.145 

0.130 

0.163 

0.154 

1.192 

1.216 

0.4 

0.129 

0.129 

0.165 

0.150 

0.175 

0.168 

1.955 

2.193 

0.5 

0.141 

0.137 

0.173 

0.161 

0.195 

0.187 

2.877 

3.367 

0.6 

0.139 

0.148 

0.178 

0.166 

0.197 

0.188 

4.572 

5.449 

0.7 

0.151 

0.147 

0.181 

0.166 

0.199 

0.194 

7.319 

9.139 

0.8 

0.154 

0.155 

0.191 

0.178 

0.213 

0.207 

11.830 

16.177 

0.9 

0.162 

0.155 

0.192 

0.177 

0.211 

0.206 

24.849 

31.944 

0.95 

0.169 

0.168 

0.203 

0.190 

0.228 

0.222 

58.334 

73.344 





n = 2500 





K 

Bayes 50 o 

Bayesiooo 

LASls 

LAS lad 

RIDls 

RID lad 

LS 

LAD 

0.1 

0.077 

0.073 

0.098 

0.081 

0.100 

0.087 

0.301 

0.231 

0.2 

0.099 

0.097 

0.124 

0.108 

0.141 

0.129 

0.722 

0.647 

0.3 

0.120 

0.104 

0.128 

0.112 

0.157 

0.146 

1.154 

1.165 

0.4 

0.127 

0.120 

0.145 

0.128 

0.171 

0.163 

1.985 

2.119 

0.5 

0.143 

0.119 

0.139 

0.126 

0.167 

0.161 

3.056 

3.569 

0.6 

0.152 

0.132 

0.152 

0.141 

0.183 

0.178 

4.245 

5.142 

0.7 

0.155 

0.138 

0.155 

0.142 

0.184 

0.178 

6.737 

8.396 

0.8 

0.172 

0.146 

0.162 

0.149 

0.193 

0.191 

12.157 

15.799 

0.9 

0.166 

0.150 

0.165 

0.153 

0.195 

0.188 

26.930 

35.734 

0.95 

0.180 

0.157 

0.171 

0.156 

0.208 

0.204 

58.033 

72.705 
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consuming, especially when the target parameter is high dimensional and sampling 
from full conditionals is performed. The Gibbs sampler presented in this section 
involve direct sampling for all the parameters, with blocked sampling for /3, with¬ 
out using additional MCMC steps. Table 2 compares median computation time in 
minutes between the different methods. Regularized least squares penalty function 
methods were considerably faster that the alternatives (excluding simple LS and 
LAD). However, as Table 1 suggests, they were also inferior to the other methods. 
The suggested Bayesian Gibbs sampler was comparable, and often preferable, in 
terms of computation time to the regularized least absolute deviations estimators. 


Table 2: Medians computation times (minutes) across 1000 simulations for the 
proposed Bayes estimator using 500 and 1000 iterations of the Gibbs sampler 
(Bayessoo and Bayesiooo), the Lasso with square (LASls) and absolute (LASlad) 
deviations penalty functions, Ridge regression with square (RID^s) and absolute 
deviations (RID lad) penalty functions and the standard least squares (LS) and 
least absolute deviation estimators (LAD) 


K 

Bayesiooo 

LASTS' 

n = 500 
LAS lad 

RID LS 

RID lad 

LS 

LAD 

0.2 

1.4 

0.0 

0.6 

0.0 

0.7 

0.0 

0.0 

0.4 

2.2 

0.0 

3.7 

0.0 

4.5 

0.0 

0.0 

0.6 

2.8 

0.0 

6.9 

0.0 

6.7 

0.0 

0.0 

0.8 

5.9 

0.1 

20.7 

0.1 

18.7 

0.0 

0.0 

0.9 

7.6 

0.2 

28.2 

0.1 

22.8 

0.0 

0.0 

0.95 

8.5 

0.1 

34.2 

0.1 

26.7 

0.0 

0.0 




n = 2500 




K 

Bayesiooo 

LASls 

LASlad 

RIDls 

RIDlad 

LS 

LAD 

0.2 

12.2 

0.2 

9.2 

0.3 

11.0 

0.0 

0.6 

0.4 

74.6 

0.7 

73.8 

0.9 

91.8 

0.1 

3.0 

0.6 

39.7 

0.5 

56.4 

0.5 

65.3 

0.1 

2.5 

0.8 

98.2 

1.5 

141.2 

1.0 

156.2 

0.2 

3.9 

0.9 

137.8 

3.0 

190.9 

1.2 

205.9 

0.2 

4.2 

9.5 

396.6 

2.6 

353.4 

1.4 

385.8 

0.6 

4.6 


6 Discussion 

This paper provided a Bayesian alternative to frequentist robust regression when 
the number of predictors and the sample size are of the same order. Standard 
M-estimators are inconsistent when considering the error accumulated over the 
vector. If it is further assumed that signal and the noise are of the same asymp¬ 
totic order, then shrinkage of many coefficients is desirable. We presented an 
hierarchical prior for model parameters and constructed a Bayes estimator suit¬ 
able for the problem. A scale mixture of normal distribution representation for 
the errors’ distribution allowed us to build an efficient Gibbs sampler with blocked 
sampling for the coefficients. Theorem 1 shows that under appropriate conditions, 
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the Bayes estimator in this problem is consistent in the £2 sense. This property 
does not hold for M-estimators in this design. 

The Bayesian estimator also outperforms regularization based methods. Al¬ 
though these methods, at least for the quadratic penalty function, can be com¬ 
puted much faster, they provide point estimate only, where the Gibbs sampler 
provides the entire posterior distribution that can be used to inference. This issue 
was also pointed out in Park and Casella [2008]. It was previously shown that the 
asymptotic distribution of M-estimators in this design is nontrivial [El Karoui et ah, 
2013, Donoho and Montanari, 2013, El Karoui, 2013]. In El Karoui [2013], the 
distribution of the M-estimator was studied as the limit of a Ridge regularized es¬ 
timators. As standard asymptotic theory does not apply here, the full distribution 
of the Bayes estimator in this regime remains as a future challenge. 

7 Proofs 

7.1 Proof of Proposition 1 

First, by Lemma 1 in El Karoui et al. [2013] we can write 

= \\P p ’ simp \\Z ~ 1/2 u 

T> 

where = denotes equality in distribution, and where u is a p-length vector dis¬ 
tributed uniformly on the sphere of radius one and where 

n 

pp,simp = arg min y- 

with XT = E ~ l / 2 Xi being a mean zero multivariate normal vector with the iden¬ 
tity matrix as its variance. Their lemma also asserts that and u are 

independent. By Result 1 in El Karoui et al. [2013], ||/3 p,s * mp || has a determinis¬ 
tic limit denoted here (and there) by t p {k). They further show how t p (k) can be 
found. Next, by a multivariate central limit theorem, for large enough p, yJpYr x ^ 2 u 
is approximately N p ( 0, E~ x ). Thus, for large enough n we have 

\\Vn0 p ~ £°)|| = rj(«)*r 1 ||a;|| 2 + o p (l) (5) 

where x ~ N p ( 0, E -1 ). Let E = TAT t be the spectral decomposition of E. For 

T) 

v ~ iVp(0, 1) and v = v*, we have 

IMI 2 = v t TA~ 1 T t v = ( w *) r A"V \~lxl 

where > st • symbolizes larger in the stochastic sense, A m in is the minimal eigen¬ 
value of E, and Xp is a Chi-squared random variable with p degrees of freedom. 
Therefore, the right hand side of (5) is O p (n) and we are done. 
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7.2 Lemma 1: Presentation and proof 

Denote B^ ,v := ^ JT 1 {\j3j\ > Cn _,?//2 } for the proportion of coordinates of /3 
that are of order larger than Cn _r?//2 . The following lemma ensures us that if the 
prior distribution of (3 admits Assumptions (P1)-(P3) then B^ ,v is not far from 
c f>/p , and consequently, Bf ,v = O p (<f>/p) = Opilog^n)^ 1 ). Thus, the proportion of 
“large” coefficient values out of the total number of coefficients goes to zero with 
probability that goes to one as n grows. 


Lemma 1. Let assumptions (Ml), (PI) and (P2) hold. Assume (P3) holds with 
some f . Then, for all 1 < rj < for all ( > 0 and for any constant C we have 


lim P 

n—>oo 



o 


Proof. First note that Bf' ,ri is a mean of p independent Bernoulli random 
variables with success probability of 


c, v - 


= 


-2 
p [ 


$ 


C 


= - 2 $ 
p 


c 


+ <f> 


CJ 


y/ni6% 




+ o(- 

p 





with $ being the CDF of a standard normal random variable. The second equality 
results from n^d\ = n v ~^ (Assumption (P3)) and since rj < £. Now, this problem 
is symmetric, so it is suffice to show 


lim p(b^’ v > — (1 + £)^ — 0. 

n—> oo \ p / 

By Chebyshev’s inequality we have 


p(b°" > ^(1 + 0) < P(\B°" - + 0 - I'n’") 


< 


V, 


c, v 


Mh 1 +o - 


-2f 


c 




+ o 


y (1 + C - 2$ 


c 




+ o[Z- 

’ p 


and the last expression goes to zero as n —> oo. 


□ 


7.3 Proof of Theorem 1 

For the simplicity of the proof we will assume that T, — I. 


13 




















Let f3 MAP = argmax^ q(/3\Y, X) be the MAP estimator. First we will show 
that \\(3 MAP — /5°|| = o p (l), and then that \\f3 MAP — /3*\\ = o p (l). We 

Starting with (3 MAP , will hrst show the weaker result \\(3 MAP — /3°|| = O p (l). 
We will then build on this result to strengthen the conclusion. For any vector f3, 
we partition /3 to two subvectors /3m and $m c , were M is the subset of relatively 
large coefficients. We will then use the fact that ||/3jvf AP — /3° M \\ — O p ( 1) to argue 
that ||/3j^f p — /3mc\\ — o P (l), and the latter to argue that ||/3j^ AP — /3m\\ — o p (l). 
Now, for the details. 

By Taylor expansion, the log-posterior of /3 can be written as 


log q(P\Y, X) 

n p 


i= 1 
n 


3 =1 


Y 4 (ei - xf (/3 - /3 °)) + Y 


i= 1 


3 =1 


Y ^( e *)+Y lo s(^(^i)) - 0 3 - z3 o r Y Xi ^ ( e * 


2=1 


J = 1 


2=1 


( 6 ) 


(7) 


+\{p- u°) T Y £ " ( e * + ^ - 0°))) (z 3 - z?° 


2=1 


for some ap G [0,1]. 

Since the empirical distribution function converges to the cumulative distribu¬ 
tion function there are M < oo and 7 < 1 such that 


p (L 1(M > M) < jn) ->• 1. 

Next we want to argue that we also have 

P(Y 1(|q - X P 0 MAP ~(3°)\> M) < 771 ) 1. 

Let U n ~ W if U n = O p (y n ) and 14 = O p (U n ). Now 

/•OO 

A|£(e) - £(0)| = 2 / |4a;) - 40)|4 (x W 

Jo 

/•oo 

= 2 / x|4(a a ,a;)|e f ^(ia: 

Jo 

/•OO 

<2 / x|4(x)|e^ x Zdx 

Jo 

/•OO 

= 2 / e t{x) dx = 2. 


( 8 ) 


(9) 


( 10 ) 


Hence n 1 XT=i 4 e *) ^ 7?4e) > — Thus, ^4 e 4 ~ n. Since f3 MAP is the 
maximizer we have that g(/3 MAP |Y, A") — g(/3°|Y,X) > 0- Consider this difference 
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and recall also line (6). Consider first the difference in the prior contributions. 
If tj = 2 (see Assumption (PI)) then the contribution of the prior can be made 
larger by making (3j smaller, but then \(3j — f3j\ = O p (n~ ^ 2 ). If tj = 1 we can 
increase the prior contribution, but there are only O p {n/ log n) such terms. Thus, 

I Y7j= 1 l°g (<?(/^ MAP )) _ Y7j= 1 l°g(<3' (/3y)) I = O p (n/ logn). Since f3 MAP is improving 
over f3° and YY ^( e 0 ~ n we must have that for some A > 0, Yj 0 MAP — 

f3 0 )) > —An with high probability. Since liut^oo = — oo, we can choose a finite 
M such that both (8) and (9) hold. Let 

A = {i : |ej| < M and |ej — Xf (/ 3 MAP — /3°)| < M}. 

We conclude from (8) and (9) that for large enough M, |Al| is the same order as 
n. Hence 


V + ctpMApXj (I3 map - /3°)))X,ACf < C(M) X t Xj (11) 

i= 1 A 

in the partial order of positive semi-definite matrices. We conclude from the prior, 
(7), and we can verify by (11) that 

\\B MAF - yil = O p (AvA) = 0,(1). (12) 

We now build on this result to strengthen the conclusion. Let A4 C {1 ,... ,p} be 
the set of indices such that \/3j\ > n“^logn. Denote by (3° M an d $j^ AP subvec¬ 
tors with indices in A4 of (3° and j3 MAP , respectively. Let A4 C the complementary 
set and define the corresponding subvectors and submatrix similarly. 

We start with estimation error of f3j^ AP . Let R = {/3 : || (3m c ~ /3%c || = 
n -7 and (3m = /3j^ AP }i where 0 < 7 < £ — 1 and consider 

W = max log q(/3\Y, X) - logq(/3 0 ’ MAP \Y,X), (13) 

/3£R 

where /3°’ MAP equals to (3° on M c and to [3 MAP on AT. Substituting (7) in (13), the 
second term on the RHS of (7) contributes to W — n^n~ 21 , the third term is at most 
O p (n -7+1 ), while the last term is by ( 11 ) and ( 12 ) at most O p (?i -27+1 ) + O p (?i -7+1 ). 
Thus W < 0, and since f3^ AP is the maximizer, \\^j^ AP — /3j ^ c || = o p (l). 

This means that the error in estimation /3m c has negligible contribution. The 
estimating equation for /3m with only the t — 1 part of the prior is a ridge regression 
with |AT | = o p (n) variables with negligible ridge penalty, thus standard mean 
calculations and Assumption (Ml) would show that ||/3j^ AP — f3° M \\ = o p (l). By 
Assumptions (P1)-(P3) the t — 2 part has a relative peak at 0 of height of the 
ratio of the two standard deviations (Assumption (P3)) and the ratio of the mixing 
probabilities (Assumption (PI)). It contributes the log of the product of these two 
terms, O(logn), but only for components which are in an 0(n ~^/ 2 ) of 0. But the 
curvature of the likelihood is O p (n ) and the corresponding components of (3° are 
1 ^ 2 ), thus the gain in the prior cannot balance the loss in the likelihood 
which is of order per component which is this neighborhood of 0 . 
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Having established that the MAP estimator is in o(l) neighborhood of the 
truth, we consider now (3*. Again, consider the partition to /3m an d /3m c - The 
log-prior of any (3j is 


log 



e C1< ^ 2 + c^ 2 n^^ 2 e 



5 


where Ci,c 2 = 0(1). Thus, for \^ AP \ < L n n~ for L n oo slowly, the 
central component of the prior dominates, and the a posteriori mass of this ball 
is negligible. For (3j /IAP « 0 -1 / 2 , we argued that the peak at 0 is much smaller 
than the value at (3^ AP . Since the radius of this peak is of order n~C It can be 
ignored. Ignoring this neighborhood of 0, the a posteriori is strictly concave with 
maximum at (3(S AP , and curvature of order n and hence (3* — (3 AIAP = O p {n~ l t 2 ). 

□ 
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